RBM
This article is about Restricted Boltzmann Machine as a recommender model. RBM stands for Restricted Boltzmann Machine. Notation Below: * The letters: ** u denotes a user ** i denotes an item ** j also denotes an item (if there are two items, they are denoted by i and j) ** r denotes rating (or anything else which a recommender tries to forecast, i. e. probability to select an item) ** R denotes the maximal rating ** f denotes a feature number ** F denotes the total number of features ** U denotes the number of users ** I denotes the number of items Model Each user also has a vector v of ratings v = (v_{11}, ..., v_{1R}, v_{21}, ..., v_{2R}, ..., v_{I1}, ..., v_{IR}) . Here v_{ir} is 1 if the item i received the rating r, otherwise 0. For instance, if the item 7's received 3 in the 1...5 rating system, then v_{73}=1 , while v_{71}=v_{72}=v_{74}=v_{75}=0 . Each user also has a hidden vector h\in \mathbb{R}^F . The hidden vector determines the probability distribution for the vector of ratings. However, as we'll see, the RBM model does not try to restore the hidden vectors explicitly, thats why the name. The joint distribution of v and h is: : P(v, h) = \exp(-E(v, h)) / \sum_{v', h'} \exp(-E(v', h')) (1.1a) where E(v,h) is called energy and defined as : E(v,h)= - \sum_{ir}b_{ir}v_{ir} - \sum_{f} b_f h_f - \sum_{ifr}W_{ifr} v_{ir} h_f + \sum_i \log Z_i (1.1b) : Z_i = \sum_r \exp\left(b_{ir}+\sum_f h_f W_{ifr}\right) (1.1c) Here b are called biases and W are called weights. It is biases and weights, not hidden vectors, which the learning process should restore. Missing (unrated) items are excluded from all sums. The term \sum_i \log Z_i is the normalization term that ensures \sum_r p(v_{ir}=1|h)=1 for each i. From the joint distribution of v and h, we can obtain conditional distributions and the distribution of v: : P(v_{ir}|h) = {\exp \left( b_{ir}+\sum_f h_f W_{ifr} \right) \over \sum_r \exp \left( b_{ir}+\sum_f h_f W_{ifr} \right)} (1.2) : P(h_f=1|v) = \sigma \left( b_f + \sum_i \sum_r v_{ir} W_{ifr} \right) (1.3) , :: where \sigma(t)=1/\left(1+e^{-t}\right) : P(v) = \sum_h {\exp \left( -E(v, h) \right) \over \sum_{v', h'} \exp \left( -E(v', h') \right)} (1.4) Learning Given the expression (1.4) for P(v) , learning should maximize the likelihood with respect to W and b , which is performed by the gradient ascent: : \Delta W_{ifr} = \epsilon {\partial \log P(v) \over \partial W_{ifr}} Denoting the gradient by W_{ifr} as \nabla_{ifr} , we have : \Delta W_{ifr} = \epsilon \nabla_{ifr} \log P(v) : \nabla_{ifr} \log P(v) = \nabla_{ifr} \log \sum_h \exp \left( -E(v, h) \right) - \nabla_{ifr} \log \sum_{v', h'} \exp \left( -E(v', h') \right) The first term is : \nabla_{ifr} \log \sum_h \exp \left( -E(v, h) \right) = : = {\nabla_{ifr} \sum_h \exp \left( -E(v, h) \right) \over \sum_h \exp \left( -E(v, h) \right)} = : = - {\sum_h \exp \left( -E(v, h) \right) \nabla_{ifr} E(v, h) \over \sum_h \exp \left( -E(v, h) \right)} = : = {\sum_h \exp \left( -E(v, h) \right) v_{ir} h_f \over \sum_h \exp \left( -E(v, h) \right)} = : = \left\langle v_{ir} h_f \right\rangle_{data} The notation \left\langle ... \right\rangle_{data} means that the expression in braces is calculated for given v and averaged over the conditional distribution P(h|v) . The second term is : \nabla_{ifr} \log \sum_{v', h'} \exp \left( -E(v', h') \right) = : = {\nabla_{ifr} \sum_{v', h'} \exp \left( -E(v', h') \right) \over \sum_{v', h'} \exp \left( -E(v', h') \right)} = : = - {\sum_{v', h'} \exp \left( -E(v', h') \right) \nabla_{ifr} E(v', h') \over \sum_{v', h'} \exp \left( -E(v', h') \right)} = : = {\sum_{v', h'} \exp \left( -E(v', h') \right) v^\prime_{ir} h^\prime_f \over \sum_{v', h'} \exp \left( -E(v', h') \right)} = : = \left\langle v_{ir} h_f \right\rangle_{model} The notation \left\langle ... \right\rangle_{model} means that the expression in braces is averaged over the joint distribution P(v, h) . : \Delta W_{ifr} = \epsilon \left( \left\langle v_{ir} h_f \right\rangle_{data} - \left\langle v_{ir} h_f \right\rangle_{model} \right) The first term can be calculated pretty easily. Given v, we can calculate the distribution for each element of h. But the second term needs exponential time to be calculated exactly. However, the function can be approximated by another function : \Delta W_{ifr} = \epsilon \left( \left\langle v_{ir} h_f \right\rangle_{data} - \left\langle v_{ir} h_f \right\rangle_T \right) This function is called "Contrastive Divergence" (CD) and proposed by Hinton in 2002. The notation \left\langle ... \right\rangle_{T} means averaging over the value running Gibbs sampler for T full steps. Below is the excerpt from the wikipedia.org article, modified: # Take a training sample v , compute the probabilities of the hidden units with (1.3), and sample a hidden activation vector h from this probability distribution. # Compute the outer product of v and h and call this the positive gradient. # Set h'=h # Repeat the following step T times: ## From h' , sample a reconstruction v' of the visible units using (1.2), then resample the hidden activations h' from this using (1.3). # Compute the outer product of v' and h' and call this the negative gradient. # Let the weight update to W_{ifr} , b_{ir} and b_f be the positive gradient minus the negative gradient, times the learning rate \epsilon In practice, T=3...10 is high enough. It is possible to start learning with small T and then increase it. Predictions After we have all the W and b , we are able to make predictions. Given a user u , a set of items with known ratings I(u) , and a set of known ratings v_{ir} , i \in I(u) , we are to predict the ratings v_{jr} for some j \notin I(u) : : p(v_{jr}=1) = {e^{b_{jr}} \prod_f \left(p(h_f=1|v)e^{-W_{jfr}}+p(h_f=0|v)\right) \over \sum_{r'} e^{b_{jr'}} \prod_f \left( p(h_f=1|v)e^{-W_{jfr'}}+p(h_f=0|v) \right) } , where p(h_f=1|v) is given by (1.3) which we reproduce here: : P(h_f=1|v) = \sigma \left( b_f + \sum_{i \in I(u)} \sum_r v_{ir} W_{ifr} \right) : \sigma(t)=1/\left(1+e^{-t}\right) Modifications Gaussian Hidden Units For Gaussian Hidden Units, the formula (1.1b) : E(v,h)= - \sum_{ir}b_{ir}v_{ir} - \sum_{f} b_f h_f - \sum_{ifr}W_{ifr} v_{ir} h_f + \sum_i \log Z_i is replaced by : E(v,h)= - \sum_{ir}b_{ir}v_{ir} - \sum_{f} {(h_f-b_f)^2 \over 2\sigma_f^2} - \sum_{ifr}W_{ifr} v_{ir} h_f + \sum_i \log Z_i (1.1b') This modifies (1.3) from : P(h_f=1|v) = \sigma \left( b_f + \sum_i \sum_r v_{ir} W_{ifr} \right) to : P(h_f=h|v) = {1 \over \sqrt{2\pi} \sigma_f} \exp \left( - { \left( h - b_f - \sigma_f \sum_{ir} v_{ir} W_{ifr} \right)^2 \over 2 \sigma_f^2} \right) In order to avoid updating the variances, they can be fixed as \sigma_f=1 . Conditional Factored RBM This approach factorizes the matrix of weights: : W_{ifr} = \sum_{k=1}^C A_{ikr} B_{kfr} Parameters For Netflix prize, the following parameters were selected: * F=100 for RBM * F=500, C=30 for conditional factored RBM * The weight used learning rate of 0.01/batch size, momemtum of 0.9, weight decay of 0.01 * T started as 1, then increased (to 9 for RBM, to 3 for conditional factored RBM). * After about 40-50 iterations (epochs), the model converged.